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Abstract 

We use the minorization-maximization principle (Lange, Hunter and Yang 2000) to es- 
tablish the monotonicity of a multiplicative algorithm for computing Bayesian D-optimal 
designs. This proves a conjecture of Dette, Pepelyshev and Zhigljavsky (2008). 
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Multiplicative algorithms (Silvey, Titterington and Torsney 1978; Torsney and Mandal 2006; 
Dette, Pepelyshev and Zhigljavsky 2008) are often employed in numerical computation of op- 
timal designs (approximate theory; see Kiefer 1974, Silvey 1980, and Pukelsheim 1993). These 
iterative algorithms are simple, easy to implement, and often increase the optimality criterion 
monotonically. In the case of D-optimality, for example, monotonicity of the algorithm of Silvey 
et al. (1978) is well known (Titterington 1976; Pazman 1986); see Yu (2010a) and the references 
therein for further results. Monotonicity is an important property as it implies convergence under 
mild conditions. 

Bayesian D-optimality is a widely used design criterion that can accommodate prior uncertainty 
in the parameters (see Chaloner and Larntz 1989 and Chaloner and Verdinelli 1995). Multiplica- 
tive algorithms extend naturally from D-optimality to Bayesian D-optimality (Dette et al. 2008). 
Although the form of the algorithms is just as simple as in the D-optimal case, a corresponding 



monotonicity result is still lacking. In the context of nonlinear regression, Dette et al. (2008) 
conjecture the monotonicity of a class of algorithms for computing Bayesian D-optimal designs. 
The main theoretical contribution of this work is to confirm their monotonicity conjecture. 

Our technical devices include convexity and the minorization-maximization principle (MM; 
Lange, Hunter and Yang 2000; Hunter and Lange 2004). Similar ideas play a key role in settling 
the related Titterington's (1978) conjecture (see Yu 2010a, 2010b). Minorization-maximization 
(or bound optimization) is a general method for constructing iterative algorithms that increase an 
objective function 4>(w) monotonically. We first construct a function Q{w\ w) such that <p(w) > 
Q(w;w) for all w and w, and (j)(w) = Q(w;w). Suppose the current iterate is w^\ We choose 
w ( t + 1 ) to increase the Q function, i.e., 

Q («,(*+!);«,(')) >g («,(*);«,(')). (1.1) 

Then w^ t+1 ' also increases the objective function 0, because 

(«/ t+1) ) > Q («; (m) ; w®) > Q (wW; w®) = (w (<) ) . 

The usual MM algorithm chooses w^ t+1 > to maximize Q (•; w®). Since we only require (11.11) . it is 
proper to call this strategy a general MM algorithm. The general MM algorithm is an extension 
of the general expectation- maximization algorithm (GEM; Dempster, Laird and Rubin 1977). 

In Section 2 we state our monotonicity result and illustrate with a simple logistic regression 
example. Section 3 proves the monotonicity result. Specifically, the algorithm of Dette et al. 
(2008) for computing Bayesian D-optimal designs is derived as a general MM algorithm. 

2 Theoretical Result and Illustration 

We focus on a finite design space X = {xi, . . . , x n }. Let 9 be the mxl parameter of interest, and 
let Ai(9) denote the m x m Fisher information matrix provided by a unit assigned to design point 
Xi. The so-called Bayesian D-optimality (Chaloner and Larntz 1989) seeks to maximize 

(f>(w) = J log det M(w, 9) dir(6), 

where tt(9) is a probability distribution representing prior knowledge about 9, and 

n 
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This is an extension of local D-optimality which chooses the design weights Wi to maximize the log- 
determinant of the Fisher information for a fixed 9. It can also be viewed as a large sample approx- 
imation to Lindley's (1956) criterion based on Shannon information. Here w = (w±, . . . ,w n ) G Q, 
and fi denotes the closure of Q = {w : YH=\ w i = 1> w % > 0}- To convert w to a finite-sample 
design, some rounding procedure is needed (Pukelsheim 1993, Chapter 12). The matrices Ai(9) 
are assumed to be well defined and nonnegative definite for every 9. 
Let us consider the following algorithm for maximizing 4>(w). Define 



d t (w)= tr(M- 1 (w,9)A t (9))dn(9). 
Algorithm I 
Set «j{°' = (w ( °\ • • • , w ( n } ) e Q. That is, wf ] > for all i. 
For t = 0, 1, . . ., compute 



u;) ; =w^ ; ^ , i = l,...,n, (2.1, 



m-«W ' 



where a^ satisfies 



In 

a w < -minrfi^W). (2.2) 



Iterate until convergence. 

A commonly used convergence criterion is 

maxdi(w {t) ) < m + e, (2.3) 



where e is a small positive constant. This is based on the general equivalence theorem (Kiefer and 
Wolfowitz 1960; Whittle 1973), which characterizes any maximizer of <f>(w), w, by max" =1 di(w) = 
m. 

Algorithm I slightly generalizes the one proposed by Dette et al. (2008). In a regression 
context, Dette et al. (2008) prove that Algorithm I is monotonic for D-optimality, i.e., when 7r(9) 
is a point mass. Numerical examples support the conjecture that Algorithm I is monotonic for 
Bayesian D-optimality in general. We shall confirm this conjecture (Theorem [T]). 



Theorem 1. Assume 4>{w) is finite for at least one w E VL. Let w^\w^ t+l ' E fi satisfy Ii2. 1\) and 
h2.2\) . Then we have 

with equality only ifw^ t+1 ' = w® . 

Once strictly monotonicity is established, global convergence holds under mild conditions. We 
state such a result where a® takes a convenient parametric form. 

Theorem 2. Assume (j>(w) is finite for at least one w E Q. Let w® be a sequence generated by 
\2. 1\) , starting with w^ E Q. Assume 

n n 

a {t) = - mm di(w {t) ), (2.4) 

where a E [0, 1] is a constant. Then all limit points of tyw are global maxima of <j>(w) on Cl. 

Note that a limit point of w® may have some zero coordinates, although we require the 
starting value w\ > for all i. Also, a® changes from iteration to iteration. Nevertheless, based 
on Theorem [H Theorem [2] can be established by an argument similar to that of Theorem 2 of Yu 
(2010a) (details omitted). 

A natural question is the choice of a®. Given similar computing costs per iteration, it is 
reasonable to choose a® based on the convergence rate. For D-optimal designs, i.e., when tt(9) is 
a point mass, Yu (2010b) analyzes the convergence rate of Algorithm I. We expect the results to 
carry over to Bayesian D-optimality. Specifically, treating the a" =0 case as the basic algorithm, 
we can view Algorithm I with a® > as an overrelaxed version (in the sense of successive 
overrelaxation in numerical analysis; see Young 1971). At each iteration, overrelaxation multiplies 
the step length of the basic algorithm by m/(m — a®). Noting Y^h=i ^i( w ) w i = m an d ( )2.2p . 
we have a® < m/2. That is, m/(m — a®) < 2. Thus, roughly speaking, overrelaxation can at 
most double the speed of the basic algorithm. A caveat is that, when the basic algorithm is very 
fast, overrelaxation may slow it down due to overshooting. Nevertheless, the examples provided 
by Dette et al. (2008) indicate that this rarely happens in practice. The slowness of the basic 
algorithm is usually the main concern. 

For the rest of this section we illustrate our theoretical results with a logistic regression model 

Pr(y = l\x,9) = 1 -Pi(y = 0\x,9) = (l + exp (-x T 9)y 1 . 
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Table 1: Iteration counts for Algorithm I with ay' specified by ( 12. 4p . 



e = 10~ 3 
e = 10- 4 



a = a = 1/4 a = 1/2 a = 3/4 a = l 



929 823 718 613 507 

4112 3643 3175 2706 2238 



More examples can be found in Dette et al. (2008). Consider the design space 

X x = {xi = (1, i/10 - 1) T : z = 1, . . . , 30} . 
The Fisher information for 9 from a unit assigned to X{ is 

A i {9)=x i x[ " Xp( f 77^^. 

(l + exp^)) 2 

Suppose the distribution 7r(0) assigns probability 1/25 to each point in the following set 

{(*,j) T : z,j = -2,-1,0,1,2}. 

We implement Algorithm I to compute the Bayesian D-optimal design. The ay> is specified by (12. 4p 
with several choices of a. Each algorithm is started at the uniform design -u/°) = (1/30, . . . , 1/30), 
and we consider two convergence criteria corresponding to (12. 3D with e = 10~ 3 and e = 10~ 4 
respectively. Table 1, which records the iteration counts, shows the advantage of using larger a 
(a < 1). The large iteration counts when e = 10~ 4 illustrate the potential slow convergence of 
Algorithm I. We also display the optimality criterion <p (w^') in Figure 1. As Theorem [1] claims, 
4> (w^) increases monotonically for each algorithm. 

Table 2 records the design weights as calculated by Algorithm I with a = 1. Note that, 
as the more stringent criterion e = 10~ 4 is adopted, the weights assigned to the middle cluster 
Xi, i = 14, . . . , 18, become more concentrated around xie- One interpretation is that Algorithm I 
sometimes has difficulty apportioning mass among adjacent design points, and therefore the con- 
vergence is slow. This also hints at potential remedies for the slow convergence. For computing 
D-optimal designs, Yu (2010c) proposes a "cocktail algorithm" that combines three different strate- 
gies for fast monotonic convergence. One of the ingredients, a special case of Algorithm I, is a 
multiplicative algorithm (Silvey et al. 1978). Another ingredient is a strategy that facilitates mass 
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Figure 1: Monotonicity of (iw^J for Algorithm I. 
Table 2: Output (design weights) of Algorithm I with a = 1 and two convergence criteria. 



design points 


Xi 


x u 


«15 


a?i6 


Xvt 


«i8 


^30 


output (e = 10~ 3 ) 
output (e = 10~ 4 ) 


0.434 
0.435 


0.006 
0.000 


0.073 
0.026 


0.114 
0.204 


0.035 
0.002 


0.003 
0.000 


0.334 
0.334 



exchange between adjacent design points. There is no conceptual problem extending the cocktail 
algorithm to Bayesian D-optimality. We are investigating such extensions and will report the 
findings in future works. 

3 Monotonicity of Algorithm I 

This section proves Theorem [U We need Lemma dj which slightly extends Lemma 1 of Dette et 
al. (2008). 



Lemma 1. For fixed 6, detM(w,6) is a polynomial in wi, . . . ,w n with nonnegative coefficients. 



Proof. Let I m denote the m x m identity matrix, and define an m x (mn) matrix G by 

G=(G l ,...,G mn )=(A{ /2 (e),...,A 1 n / 2 (e)). 

We have 

M(w, 9) = G(Diag(w) <g> J m )G T . 

The Cauchy-Binet formula (Horn and Johnson 1990, Chapter 0) yields 

det M(w,6)= ^ h{h,---,im)u h ---u im , 

l<il<---< i m <mn 

where h(i\, . . . , i m ) = det 2 (Gj i; . . . , Gi m ), and ui denotes the ith diagonal of Diag(w) ® I m . The 
claim holds because Ui is equal to one of Wj, and h{i\, . . . , i m ) > 0. □ 

Lemma [2] serves as a building block for constructing our minorization-maximization strategy. 
Lemma 2. Let g(w) be a nonzero polynomial in w = (wi, . . . ,w„) with nonnegative coefficients. 



Define 



n , ~n v^ dg(w) Wi 

Q{w;w) = ) — — -logWi, w,weS2. 



Then we have 

Q(w; w) — Q(w; w) < logg{w) — logg{w). 

Proof. Write g{w) = X/i=i c jfj( w ) where Cj > and fj(w) are monomials in w. We have 

i=l 4 

because fj are monomials. Multiplying both sides by Cj/g(w) and then summing over j yield 

jr[ g(w) 

Hence 

„, ~ N „,_ ~ N . / \ , /~\ v^ CjfAw) , CjfJw)/g(w) 
Q(w; w) - Q(w,w) - logg(w) + logg(w) = ^ Jjj l ; log "•' 



^ g(w) CjfAw)/ g{w) 



< 



log e 



c i/i( w ) 



,. , ^iwj 



= 0, 
where the inequality holds by Jensen's inequality applied to the concave function log x. □ 



Lemma [3] is implicit in Dette et al. (2008). The proof is included for completeness. 

Lemma 3. Define Q(w) = J^ILi q% log u>j, q, w G f2. For a /ixed w, let a be a scalar that satisfies 

1 n . Qi 
a < — mm — . 

2 i=l W{ 

Then we have 

A q — aw 

Q{w)>Q(w), w = - , 

1 — a 

with equality only if w = w. 
Proof. Letting r^ = qi/wi, we have 

n 

Q(w) - Q(w) = 22 W i r i lo § T 



n - a 



a 

f — a 
> r log 

1 — a 

= 0, 

where f = J^jWjrj = 1 (hence the last equality), and the inequality follows by Jensen's inequality 
applied to the function xlog(x — a), which is convex on x > max{0, 2a}. Hence Q(w) > Q{w). 
By strict convexity, equality holds only when r« = f = 1 for all i, i.e., when w = w. □ 

Proof of TheoremUl It is easy to see that, if 4>(w) is finite for any w E Q, then it is finite for all 
w G Q. Define g(w,9) = detM(w,9). Because <f>(w) is finite, we have g(w,9) > almost surely 
with respect to 7f(6). By Lemma [TJ for fixed 9, g(w,9) is a polynomial in w with nonnegative 
coefficients. Define (w, w G fi) 

Q{w;w\9) = > — ——- \ogWi 

^ dwi g{w, 9) 



= ^2tr{M- 1 (w,9)A i (9))w i \ogw 

8=1 

By Lemma [2], we have 

Q(w; w\9) — Q(w; w\9) < log g(w, 9) — \ogg(w 



Integration yields 

n „ 

Y^di{w)Wi\og^-= / [Q(w;w\6)-Q(w;w\e)} dn(6) 
1=1 * J 



< / i\ogg(w,9)-hgg(w,6)}dir(6) 



[W] — 0(W). 



That is, the function 

Q(w; w) = 22 di{w)wi log -^ + 0(w) 



=1 * 



satisfies Q(w; w) < <f)(w) and Q(w; w) = <f)(w). This forms the basis of minorization-maximization. 
Suppose w''',w' f+1 ' G fl are related by (12.11) . Applying Lemma [3] with qi = d{ (w®) w\ /m 
(note that Ym=\ 9* = -0> we § e ^ 

Q (w( t+1 V* } ) > Q (w (t V t} ) , (3.1) 

as long as (12.21) holds. Thus (w^ t+1 ^) > Q (m/ 4-1-1 * 1 ; tu^) > Q (?«'''; tw^) = </> (u>^) , and mono- 
tonicity is proved. Lemma [3] implies that equality holds in (13. ip only when w^ t+1 ^ = w®. Hence 
the monotonicity is strict. D 

Remark 1. Theorem [1] assumes that w^\w^ t+1 ^ G f2, i.e., they have all positive coordinates. 
This assumption can be relaxed. Inspection of the above proof shows that strict monotonicity 
holds as long as w® G 0, and (p (w^) is finite. 

Remark 2. The arguments of Yu (2010a), based on two layers of auxiliary variables, can 
be extended to prove the monotonicity of (12. ip assuming a® = 0. This is however weaker than 
Theorem [1] in the present form. 
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